open-source dataset
- Asia > Middle East > Jordan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- North America > United States > Michigan (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Security & Privacy (1.00)
- Social Sector (0.67)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking Supplementary Material
Motivation For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? In the current task of open-vocabulary multi-object tracking (OVMOT), there is only one benchmark available, which lacks high-quality, large-scale datasets. The existing dataset suffers from several limitations, including insufficient categories, limited video data, and a significant imbalance between base classes and novel classes. These deficiencies make it inadequate for supporting the evaluation of new OVMOT models. Our proposed dataset aims to provide a more comprehensive evaluation platform for the OVMOT task. Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? This dataset was constructed by collecting and extracting data from seven other datasets and applying unified annotations. This work was completed by Haiji Liang and Ruize Han. Who funded the creation of the dataset?
- Law (0.94)
- Information Technology (0.69)
- North America > United States > Michigan (0.04)
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Security & Privacy (1.00)
- Social Sector (0.67)
NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech
Borisov, Maksim, Spirin, Egor, Diatlova, Daria
Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at https://huggingface.co/datasets/deepvk/NonverbalTTS.
Alleviating Attack Data Scarcity: SCANIA's Experience Towards Enhancing In-Vehicle Cyber Security Measures
Sundfeldt, Frida, Widstam, Bianca, Moghadam, Mahshid Helali, Liang, Kuo-Yun, Vesterberg, Anders
The digital evolution of connected vehicles and the subsequent security risks emphasize the critical need for implementing in-vehicle cyber security measures such as intrusion detection and response systems. The continuous advancement of attack scenarios further highlights the need for adaptive detection mechanisms that can detect evolving, unknown, and complex threats. The effective use of ML-driven techniques can help address this challenge. However, constraints on implementing diverse attack scenarios on test vehicles due to safety, cost, and ethical considerations result in a scarcity of data representing attack scenarios. This limitation necessitates alternative efficient and effective methods for generating high-quality attack-representing data. This paper presents a context-aware attack data generator that generates attack inputs and corresponding in-vehicle network log, i.e., controller area network (CAN) log, representing various types of attack including denial of service (DoS), fuzzy, spoofing, suspension, and replay attacks. It utilizes parameterized attack models augmented with CAN message decoding and attack intensity adjustments to configure the attack scenarios with high similarity to real-world scenarios and promote variability. We evaluate the practicality of the generated attack-representing data within an intrusion detection system (IDS) case study, in which we develop and perform an empirical evaluation of two deep neural network IDS models using the generated data. In addition to the efficiency and scalability of the approach, the performance results of IDS models, high detection and classification capabilities, validate the consistency and effectiveness of the generated data as well. In this experience study, we also elaborate on the aspects influencing the fidelity of the data to real-world scenarios and provide insights into its application.
- Europe > Switzerland > Geneva > Geneva (0.04)
- North America > United States > Texas (0.04)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Transportation > Ground > Road (0.68)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Communications > Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
HelpSteer 2: Open-source dataset for training top-performing reward models
High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences.As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling.Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers.To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful Nemotron-4-340B base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024.Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. Additionally, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models.